Basics of Data Visualization and Causal Graphs
Our first steps with ggplot2
and ggdag
## The libraries we will use today
library(palmerpenguins) # To get our example's dataset
library(tidyverse) # To use dplyr functions and the pipe operator when needed
library(ggplot2) # To visualize data (this package is also loaded by library(tidyverse))
library(ggdag) # To create our DAGsWelcome
This week’s tutorial will be divided in two broader camps.
- First, we will learn some basics of data visualization with
ggplot. - Second, we will start our exploration of directed acyclic graphs (DAGs) for causal inference.
Introduction to ggplot2
ggplot2 is by far the most popular visualization package
in R. ggplot2 implements the
grammar of graphics to render a versatile syntax of
creating visuals. The underlying logic of the package relies on
deconstructing the structure of graphs (if you are interested in this
you can read this article).
For the purposes of this introduction to visualization with ggplot,
we care about the layered nature of visualizing with
ggplot2.
Our building blocks
During this week, we will learn about the following building blocks:
- Data: the data frame, or data frames, we will use to plot
- Aesthetics: the variables we will be working with
- Geometric objects: the type of visualization
- Theme adjustments: size, text, colors etc
Data
The first building block for our plots are the data we intend to map.
In ggplot2, we always have to specify the object where our
data lives. In other words, you will always have to specify a data
frame, as such:
In the future, we will see how to combine multiple data sources to build a single plot. For now, we will work under the assumption that all your data live in the same object.
Aesthetics
The second building block for our plots are the aesthetics. We need to specify the variables in the data frame we will be using and what role they play.
To do this we will use the function aes() within the
ggplot() function after the data frame (remember to
add a comma after the data frame).
Beyond your axis, you can add more aesthetics representing further dimensions of the data in the two dimensional graphic plane, such as: size, color, fill, to name but a few.
Geometric objects
The third layer to render our graph is a geomethic object. To add
one, we need to add a plus (+) at the end of the
initial line and state the type of geometric object we want to add, for
example, geom_point() for a scatter plot, or
geom_bar() for barplots.
Theme
At this point our plot may just need some final touches. We may want to fix the axes names or get rid of the default gray background. To do so, we need to add an additional layer preceded by a plus sign (+).
If we want to change the names in our axes, we can utilize the
labs() function.
We can also employ some of the pre-loaded themes, for example,
theme_minimal().
ggplot(name_of_your_df, aes(x = your_x_axis_variable, y = your_y_axis_variable)) +
geom_point() +
theme_minimal() +
labs(x = "Name you want displayed",
y = "Name you want displayed")Our first plot
For our very first plot using ggplot2, we will use the
penguins data from last week.
We would like to create a scatterplot that illustrates the relationship between the length of a penguin’s flipper and their weight.
To do so, we need three of our building blocks: a) data, b)
aesthetics, and c) a geometric object (geom_point()).
Activity 💡
Once we have our scatterplot. Can you think of a way to adapt the code to:
- convey another dimension through color, the species of penguin
- change the axes names
- render the graph with
theme_minimal().
- render the graph with
➨ Click for Answer 🔓
ggplot(penguins, aes(x=flipper_length_mm, y=body_mass_g, color=species)) +
geom_point() +
theme_minimal()Visualizing effectively
Plotting distributions
If we are interested in plotting distributions of our data, we can leverage geometric objects, such as:
geom_histogram(): visualizes the distribution of a single continuous variable by dividing the x axis into bins and counting the number of observations in each bin (the default is 30 bins).geom_density(): computes and draws kernel density estimate, which is a smoothed version of the histogram.geom_bar(): renders barplots and in plotting distributions behaves in a very similar way fromgeom_histogram()(can also be used with two dimensions)
This is a histogram presenting the weight distribution of penguins in our sample. .
Activity 💡
Let’s adapt the code of our histogram above and explore what happens:
- add
bins = 15argument type different numbers)
- add
- add
fill = "#FF6666"(type “red”, “blue”, instead of #FF6666)
- add
- change the geom to
_densityand_bar
- change the geom to
Plotting relationships
We can utilize graphs to explore how different variables are related. In fact, we did so before in our scatterplot. We can also use box plots and lines to show some of these relationships.
For example, this boxplot showcasing the distribution of weight by species:
ggplot(penguins, aes(x = species, y = body_mass_g)) +
geom_boxplot() +
theme_minimal() +
labs(x = "Species",
y = "Body mass (g)")Or this adaptation of our initial plot with a line of best fit for the observed data by each species:
ggplot(penguins, aes(x= flipper_length_mm, y = body_mass_g, color = species)) +
geom_point() +
geom_smooth(method = "lm", se = F) +
theme_minimal() +
labs(x = "Length of the flipper",
y = "Body mass (g)",
color = "Species")Next steps
Now that you have been introduced to some of the basics of
ggplot2, the best way to move forward is to
experiment. As we have discussed before, the R
community is very open. Perhaps, you can gather some inspiration from
the Tidy Tuesday social data project in R where users explore a new
dataset each week and share their visualizations and code on Twitter
under #TidyTuesday. You can explore some of the previous visualizations
here
and try to replicate their code.
Here is a
curated list of awesome ggplot2 resources.
Directed Acyclic Graphs (DAGs)
This week we learned that directed acyclic graphs (DAGs) are very useful to express our beliefs about relationships among variables.
DAGs are compatible with the potential outcomes framework. They give us a more convenient and intuitive way of laying out causal models. Next week we will learn how they can help us develop a modeling strategy.
Today, we will focus on their structure and some DAG basics with the
ggdag package.
Creating DAGs in R
To create our DAGs in R we will use the
ggdag packages.
The first thing we will need to do is to create a
dagified object. That is an object where we state our
variables and the relationships they have to each other. Once we have
our dag object we just need to plot with the
ggdag() function.
Let’s say we want to re-create this DAG:
We would like to express the following links:
- P -> D
- D -> M
- D -> Y
- M -> Y
To do so in R with ggdag, we would use
the following syntax:
dag_object <- ggdag::dagify(variable_being_pointed_at ~ variable_pointing,
variable_being_pointed_at ~ variable_pointing,
variable_being_pointed_at ~ variable_pointing)After this we would just:
Let’s plot this DAG
Activity 💡
See what happens when you add + theme_minimal(),
+ theme_void(), or + theme_dag() to the DAG
above. What package do you think is laying behind the mappings of
ggdag?
Polishing our DAGs in R
As you may have seen, the DAG is not rendered with the nodes in the positions we want.
If you ever want to explicitly tell ggdag where to
position each node, you can tell it in a Cartesian coordinate plane.
Let’s take P as the point (0,0):
coord_dag <- list(
x = c(p = 0, d = 1, m = 2, y = 3),
y = c(p = 0, d = 0, m = 1, y = 0)
)
our_dag <- ggdag::dagify(d ~ p,
m ~ d,
y ~ d,
y ~ m,
coords = coord_dag)
ggdag::ggdag(our_dag) + theme_void()More complex example:
Let’s say we’re looking at the relationship between smoking and cardiac arrest. We might assume that smoking causes changes in cholesterol, which causes cardiac arrest:
smoking_ca_dag <- ggdag::dagify(cardiacarrest ~ cholesterol,
cholesterol ~ smoking + weight,
smoking ~ unhealthy,
weight ~ unhealthy,
labels = c("cardiacarrest" = "Cardiac\n Arrest",
"smoking" = "Smoking",
"cholesterol" = "Cholesterol",
"unhealthy" = "Unhealthy\n Lifestyle",
"weight" = "Weight"),
latent = "unhealthy",
exposure = "smoking",
outcome = "cardiacarrest")
ggdag::ggdag(smoking_ca_dag, # the dag object we created
text = FALSE, # this means the original names won't be shown
use_labels = "label") + # instead use the new names
theme_void()In this example, we:
- Used more meaningful variable names
- Created a variable that was the result of two variables vs. just one (cholesterol)
- Used the “labels” argument to rename our variables (this is useful if your desired final variable name is more than one word)
- Specified which variables are latent, the exposure, and outcome
Common DAG path structures
coord_dag <- list(
x = c(d = 0, x = 1, y = 2),
y = c(d = 0, x = 1, y = 0)
)
our_dag <- ggdag::dagify(x ~ d,
y ~ d,
y ~ x,
coords = coord_dag)
ggdag::ggdag(our_dag) + theme_void()Activity 💡
Let’s adapt the code to make X a confounder and a collider.
➨ Click for Answer 🔓
Confounder:
confounder_dag <- ggdag::dagify(d ~ x,
y ~ d,
y ~ x,
coords = coord_dag)
ggdag::ggdag(confounder_dag) + theme_void() Collider:
collider_dag <- ggdag::dagify(x ~ d,
y ~ d,
x ~ y,
coords = coord_dag)
ggdag::ggdag(collider_dag) + theme_void() DAGs and modeling (a brief introduction)
As we can remember from our lecture, there is a set of key rules in understanding how to employ DAGs to guide our modeling strategy.
- A path is open or unblocked at non-colliders (confounders or mediators)
- A path is (naturally) blocked at colliders
- An open path induces a statistical association between two variables
- Absence of an open path implies statistical independence
- Two variables are d-connected if there is an open path between them
- Two variables are d-separated if the path between them is blocked
Here it is once again, because it is very important
In this portion of the tutorial, we will demonstrate how different biases come to work when we model our relationships of interest.
What happens when we control for a collider?
The case for beauty, talent, and celebrity (What happens when we control for a collider?)
As it is showcased from our DAG, we assume that earning celebrity status is a function of an individual’s beauty and talent.
We will simulate 1000 data points that reflect these assumptions. In our world, someone gains celebrity status if the sum of units of beauty and celebrity are greater than 8.
# beauty - 1000 observations with mean 5 units of beauty and sd 1.5 (arbitrary scale)
beauty <- rnorm(1000, 5, 1.5)
# talent - 1000 observations with mean 3 units of talent and sd 1 (arbitrary scale)
talent <- rnorm(1000, 3, 1)
# celebrity - binary
celebrity_status <- ifelse(beauty + talent > 8, "Celebrity" , "Not Celebrity") # celebrity if the sum of units are greater than 8
celebrity_df <- dplyr::tibble(beauty, talent, celebrity_status) # we make a df with our values
head(celebrity_df, 10)## # A tibble: 10 × 3
## beauty talent celebrity_status
## <dbl> <dbl> <chr>
## 1 4.00 3.47 Not Celebrity
## 2 4.17 4.73 Celebrity
## 3 4.70 2.61 Not Celebrity
## 4 5.01 2.01 Not Celebrity
## 5 3.21 1.32 Not Celebrity
## 6 6.22 4.27 Celebrity
## 7 5.84 1.57 Not Celebrity
## 8 2.32 3.34 Not Celebrity
## 9 2.49 2.66 Not Celebrity
## 10 4.36 2.97 Not Celebrity
In this case, as our simulation suggests, we have a collider structure. We can see that celebrity can be a function of beauty or talent. Also, we can infer from the way we defined the variables that beauty and talent are d-separated (ie. the path between them is closed because celebrity is a collider).
Say you are interested in researching the relationship between beauty and talent for your Master’s thesis, while doing your literature review you encounter a series of papers that find a negative relationship between the two and state that more beautiful people tend to be less talented. The model that these teams of researchers used was the following:
\[Y_{Talent} = \beta_0 + \beta_1Beauty + \beta_2Celebrity\]
Your scientific hunch makes you believe that celebrity is a collider and that by controlling for it in their models, the researchers are inducing collider bias, or endogenous bias. You decide to move forward with your thesis by laying out criticism of previous work in the field, given that you consider the formalization of their models is erroneous. You utilize the same data previous papers used, but based on your logic, you do not control for celebrity status. This is what you find:
True model
##
## Call:
## lm(formula = talent ~ beauty, data = celebrity_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.1212 -0.6345 0.0028 0.6685 3.4793
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.86566 0.11287 25.390 <2e-16 ***
## beauty 0.01951 0.02172 0.898 0.369
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.014 on 998 degrees of freedom
## Multiple R-squared: 0.0008074, Adjusted R-squared: -0.0001938
## F-statistic: 0.8064 on 1 and 998 DF, p-value: 0.3694
ggplot(celebrity_df, aes(x=beauty,
y=talent)) +
geom_point() +
geom_smooth(method = "lm", se = F) +
theme_minimal() +
theme(legend.position = "bottom") +
labs(x = "Beauty",
y = "Talent")Biased model from previous literature
Let’s see:
biased_model_celibrity <- lm(talent ~ beauty + celebrity_status, data = celebrity_df)
summary(biased_model_celibrity)##
## Call:
## lm(formula = talent ~ beauty + celebrity_status, data = celebrity_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.82834 -0.49080 0.02614 0.51531 2.77059
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.44487 0.14274 38.15 <2e-16 ***
## beauty -0.33580 0.02314 -14.51 <2e-16 ***
## celebrity_statusNot Celebrity -1.59909 0.06833 -23.40 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8152 on 997 degrees of freedom
## Multiple R-squared: 0.3551, Adjusted R-squared: 0.3538
## F-statistic: 274.5 on 2 and 997 DF, p-value: < 2.2e-16
ggplot(celebrity_df, aes(x=beauty, y=talent, color = celebrity_status)) +
geom_point() +
geom_smooth(method = "lm", se = F) +
theme_minimal() +
theme(legend.position = "bottom") +
labs(x = "Beauty",
y = "Talent",
color = "")As we can see, by controlling for a collider, the previous literature was inducing a non-existent association between beauty and talent, also known as collider or endogenous bias.
What happens when we fail to control for a confounder?
Shoe size and salary (What happens when we fail to control for a confounder?)
# sex - replicate male and female 500 times each
sex <- rep(c("Male", "Female"), each = 500)
# shoe size - random number with mean 38 and sd 4, plus 4 if the observation is male
shoesize <- rnorm(1000, 38, 2) + (4 * as.numeric(sex == "Male"))
# salary - a random number with mean 25 and sd 2, plus 5 if the observation is male
salary <- rnorm(1000, 25, 2) + (5 * as.numeric(sex == "Male"))
salary_df <- dplyr::tibble(sex, shoesize, salary)
head(salary_df, 10)## # A tibble: 10 × 3
## sex shoesize salary
## <chr> <dbl> <dbl>
## 1 Male 40.9 26.8
## 2 Male 39.6 30.6
## 3 Male 41.7 29.6
## 4 Male 41.4 32.8
## 5 Male 42.3 29.7
## 6 Male 41.9 28.2
## 7 Male 45.1 28.5
## 8 Male 45.4 32.0
## 9 Male 43.0 30.7
## 10 Male 43.2 32.8
Say now one of your peers tells you about this new study that suggests that shoe size affects an individuals’ salary. You are a bit skeptical and read it. The model that these researchers apply is the following:
\[Y_{Salary} = \beta_0 + \beta_1ShoeSize\]
Your scientific hunch makes you believe that this relationship could be confounded by the sex of the respondent. You think that by failing to control for sex in their models, the researchers are inducing omitted variable bias. You decide to open their replication files and control for sex. This is what you find:
\[Y_{Salary} = \beta_0 + \beta_1ShoeSize + \beta_2Sex\]
True model
##
## Call:
## lm(formula = salary ~ shoesize + sex, data = salary_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.7120 -1.3442 0.0303 1.4450 6.0751
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 24.53368 1.16975 20.973 <2e-16 ***
## shoesize 0.01035 0.03084 0.336 0.737
## sexMale 5.05927 0.17977 28.144 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.021 on 997 degrees of freedom
## Multiple R-squared: 0.6151, Adjusted R-squared: 0.6143
## F-statistic: 796.5 on 2 and 997 DF, p-value: < 2.2e-16
ggplot(salary_df, aes(x=shoesize, y=salary, color = sex)) +
geom_point() +
geom_smooth(method = "lm", se = F) +
theme_minimal() +
theme(legend.position = "bottom") +
labs(x = "Shoe size",
y = "Salary",
color = "")Biased model from previous literature
##
## Call:
## lm(formula = salary ~ shoesize, data = salary_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.7350 -1.9593 0.0689 1.9028 7.4865
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.73462 1.17364 2.33 0.02 *
## shoesize 0.62059 0.02936 21.14 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.706 on 998 degrees of freedom
## Multiple R-squared: 0.3092, Adjusted R-squared: 0.3085
## F-statistic: 446.8 on 1 and 998 DF, p-value: < 2.2e-16
ggplot(salary_df, aes(x=shoesize, y=salary)) +
geom_point() +
geom_smooth(method = "lm", se = F) +
theme_minimal() +
labs(x = "Shoe size",
y = "Salary")As we can see, by failing to control for a confounder, the previous literature was creating a non-existent association between shoe size and salary, incurring in omitted variable bias.
Acknowledgements
This tutorial is based largely on chapters 7 to 10 from the QPOLR book.
This script was drafted by Lisa Oswald and Sebastian Ramirez-Ruiz, with contributions by Adelaida Barrera Daza, Carolina Díaz Suárez, Sofía García-Durrer, and William Fernandez.